feat: Add range partitioning support#174
Conversation
hiroyuki-sato
left a comment
There was a problem hiding this comment.
Thank you for creating this PR. I'll review this later. Before review this PR, I have a question. Could you check my comment?
| range_partitioning: | ||
| field: customer_id | ||
| range: | ||
| start: '1' |
There was a problem hiding this comment.
Why does this part use string instead of integer? As far as I know, range partition uses a number. And we can check the range is valid (start < end) if we use integer.
What do you think of this configuration layout?
(Do we need a range block?).
range_partitioning:
field: customer_id # document uses `column` but this plugin uses `field`. [1]
start: 1
end: 1000
interval: 10
# [1] https://cloud.google.com/bigquery/docs/creating-partitioned-tables#create_an_integer-range_partitioned_table(This is just my opinion, I want to ask co-maintainers this comment.)
There was a problem hiding this comment.
Thank you for the comment! I followed the api documentation. If you prefer integer, I'll change this!
https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#RangePartitioning
There was a problem hiding this comment.
Do we need a range block?
I fixed it with 49d2c36.
There was a problem hiding this comment.
Hello, @kitagry. Thank you for waiting.
Could you use this layout? (Sorry, we decided to use the original design (except using integer instead of string in range values.)
range_partitioning:
field: customer_id
range:
start: 1 # integer not string.
end: 99999 # integer not string.
interval: 1 # integer not string.and Could you check the range start + interval < end?
If you have any concern, please let me know.
I referenced the time_partition configurations.
BigQuery API use
{
"type": string,
"expirationMs": string,
"field": string,
"requirePartitionFilter": boolean
}embulk configuration
type: bigquery
table: table_name$20160929
time_partitioning:
type: DAY
expiration_ms: 259200000 # integer not strong, use sake caseWe discussed this using this design document. (Written in Japanese)
After modification, I'll check the partition feature.
|
@kitagry Thank you for more work on this PR. I'll review this PR please wait. I'm not the original plugin developer. So, I need to investigate the configuration rule. If this plugin is based on the API settings, It would be better to use the original |
49d2c36 to
f8d039f
Compare
|
Hi @hiroyuki-sato , I changed range-partitioning fileds to be integer in f8d039f |
f8d039f to
5398f69
Compare
|
@kitagry Thanks. I will test this PR later. Please wait for a while. |
|
@kitagry Thank you for waiting. I'm so sorry, but I can't get the review time. I'll review this in April. |
|
@hiroyuki-sato Hi. Could you review this? |
There was a problem hiding this comment.
Hello, @kitagry
Sorry for the late review. I was able to confirm that the main part of this PR works. I was able to create a partition table using this dummy data.
I added the id field using the following command.
I have some comments. Could you take a look?
nkf -w input.csv | awk -F, '{ print NR-1","$0 }' > curry.csv
in:
type: file
path_prefix: ./curry.csv
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ','
quote: '"'
escape: '"'
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
- {name: id, type: long}
- {name: name, type: string}
- {name: kana, type: string}
- {name: address, type: string}
- {name: sex, type: string}
- {name: age, type: long}
- {name: birthday, type: timestamp, format: '%Y/%m/%d'}
- {name: marriage, type: string}
- {name: prefecture, type: string}
- {name: mobile, type: string}
- {name: carrier, type: string}
- {name: curry, type: string}
out:
type: bigquery
mode: append
project: project
dataset: dataset
auth_method: service_account
json_keyfile: key.json
auto_create_table: true
table: part_test
location: asia-northeast1
range_partitioning:
field: id
range:
start: 1
end: 5001
interval: 500|
Thank you for the review. I fixed. |
hiroyuki-sato
left a comment
There was a problem hiding this comment.
@kitagry Thank you for your work. I left two comments. Could you tell me your opinion?
| end | ||
| elsif Helper.has_partition_decorator?(task['table']) | ||
| # If user specify range_partitioning, it should be used as is | ||
| elsif Helper.has_partition_decorator?(task['table']) && task['range_partitioning'].nil? |
There was a problem hiding this comment.
Could you elaborate on this part? IIUC Helper.has_partition_decorator?(task['table']) returns true if the table name contains $. why you add task['range_partitioning'].nil??
I can't understand If user specify range_partitioning, it should be used as is means.
In my opinion, these checks are preferable.
# can't use two partitions config at the same time.
if task['time_partitioning'] && task['range_partitioning']
raise ConfigError.new ...
end
# partition decrator doesn't support range_partition (if needed)
if Helper.has_partition_decorator?(task['table']) && task['range_partitioning']
raise ConfigError.new ...
endThere was a problem hiding this comment.
Thank you for the review. I think that the following code would be preferable.
if Helper.has_partition_decorator?(task['table'])
task['time_partitioning'] = {'type' => 'DAY'}
end
if task['time_partitioning'] && task['range_partitioning']
raise ConfigError.new ...
endThere was a problem hiding this comment.
Hmm, the error message would be confusing. I fix like e9cc945. Could you check this?
There was a problem hiding this comment.
(This is just a double check.) the error message means If user specify range_partitioning, it should be used as is?
There was a problem hiding this comment.
When I use like following. Users who use partition decorator may be confused, because he don't use time_partitioning.
if Helper.has_partition_decorator?(task['table'])
task['time_partitioning'] = {'type' => 'DAY'}
end
if task['time_partitioning'] && task['range_partitioning']
raise ConfigError.new "`time_partitioning` and `range_partitioning` cannot be used at the same time"
end
There was a problem hiding this comment.
Oh, you mean if the user uses the following configuration, the user may confuse the storage message. correct?
If so, your proposed PR seems good.
table: part_test$200105
range_partitioning:
field: id
range:
start: 1
end: 5001
interval: 500
1e0f108 to
e9cc945
Compare
hiroyuki-sato
left a comment
There was a problem hiding this comment.
@kitagry Thank you for fixing the PR. Basically, LGTM. Could you check two minor comments?
|
@kitagry Thank you for your work. LGTM. I'll re-check this change and request co-maintainer review. |
hiroyuki-sato
left a comment
There was a problem hiding this comment.
@kitagry, could you fix the test and check my comments? I'll approve this PR after fixing the test. (I'll request a check co-maintainer)
|
Thank you for the comment. I'll see tomorrow 🙏 |
Co-authored-by: Hiroyuki Sato <hiroysato@gmail.com>
de9c8be to
068acba
Compare
hiroyuki-sato
left a comment
There was a problem hiding this comment.
@kitagry LGTM👍
@joker1007 Could you take a look this PR?
|
@joker1007 Thanks! |

This pull request introduces support for range partitioning in BigQuery.
I checked this feature with
example/config_replace_field_range_partitioned_table.yml